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BNTRODUCTION 


® Developers may “embed” or “clone” code 
from 3 rd party sources 

■ Static linking 

■ Maintaining a internal copy of a library. 

■ Forking a library. 

® Lots of examples 

■ XML parsing -> libxml in various programs 

■ Image processing -> libpng in Firefox 

■ Networking -> Open SSL in Cisco IOS 

■ Compression -> zlib everywhere 


EMBEDDED IS BAD PRACTICE 


® Linux policies generally disallow (image 
below). 

® It still happens. 



©Multiple versions of packages now exist. 
® Each copy needs patches from upstream. 


® Copies become insecure over time from 
unapplied patches. 








THE MANUAL APPROACH 

® Scan binaries for version strings. 


® Done in 2005 on mass scale for zlib in Debian 
Linux. 


bzlib_private.h:#define BZ_VERSION "1.0.5, 10-Dec-2007 


tiffvers.h:#define TIFFLIB_VERSION_STR "LIBTIFF, Version 
3.8.2\nCopyright (c) 1988-1996 Sam Leffler\nCopyright (c) 
1991-1996 Silicon Graphics, Inc." 


png.h:#define PNG_HEADER_VERSION_STRING \ 

" libpng version 1.2.27 - April 29, 2008\n" 





MOTIVATION 

® 10,000 - 20,000 packages in Linux distros. 


® Debian tracks over 420 libraries (see below). 
®Most distros don’t track at all. 


® How many vulnerabilities are there? 


® How to automate? 


php-htmlpuri£ier 

- mahara 1.2.5-1 (embed) 

- knowledgeroot 0.9.9.5-5 (embed) 

- moodle <unfixed> (embed) 


- gnome-peercast <rerooved> (embed) 
[etch] - gnome-peercast <unfixed> (embe 
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PROBLEM DEFINITION 

® Find package code re-use in sources. 

® Infer vulns caused by out-of-date code . 




our approach 


©Consider code re-use detection a binary 
classification problem: 

■ Do packages A and B share code? Yes or no? 

® Features for classification: 

■ Common filenames 

■ Hashes 

■ Fuzzy content 


STATISTICAL 

CLASSIFICATION 


STATISTICAL CLASSIFICATION 

® Classification assigns classes to objects. 


® Supervised learning. 


® Unsupervised learning. 



Class 1 - Spam 


Class 2 - Not Spam 





FEATURE EXTRACTOR 


® Feature vector 


1. N_Filenames_A 

2. N_Filenames_Source_A 

3. N_Filenames_B 

4. N_Filenames_Source_B 

5. N_Common_Filenames 

6. N_Common_Similar_Filenames 

7. N_Common_FilenameHashes 

8. N_Common_FilenameHash80 

9. N_Common_ExactFilenameHash 

10. N_Score_of_Common_Filename 

11. N_Score_of_Common_Similar_Filename 

12. N_Score_of_Common_FilenameHash 

13. N_Score_of_Common_FilenameHash80 

14. N_Score_of_Common_ExactFilenameHash80 

15. N_Data_Common_Filenames 

16. N_Data_Common_Similar_Filenames 

17. N_Data_Common_FilenameHashes 

18. N_Data_Common_FilenameHash80 

19. N_Data_Common_ExactFilenameHash 

20. N_Data_Score_of_Common_Filename 

21. N_Data_Score_of_Common_Siinilar_Filename 

22. N_Data_Score_of_Common_FilenameHash 

23. N_Data_Score_of_Common_FilenameHash80 

24. N_Data_Score_of_Common_ExactFilenameHash80 

25. N_Common_ExactHash 

26. N_Common_DataExactHash 




NUMBER OF COMMON FILENAMES 

® Source and data. 

® Normalize names. 


c 

cpp 

cxx 

cc 

php 

inc 

java 

py 

rb 

js 

pl 

pm 

ml 

mli 

lua 


expat-2.0.1/lib 

tla-1.3.5+dfsg/sr c/expat/lib/ 

amigaconfig.h 


ascii.h 

ascii.h 

asciitab.h 

asciitab.h 

expat.dsp 

expat.dsp 

expatextemal ,h 

expatextemal.h 

expat.h 

expat.h 

expatstatic.dsp 

expatstatic.dsp 

expatw.dsp 

expatw.dsp 

expatwstatic. dsp 

expatwstatic. dsp 

iasciitab.h 

iasciitab.h 

internal ,h 

intemal.h 

latinltab.h 

latinltab.h 

libexpat.def 

libexpat.def 

libexpatw.def 

libexpatw.def 

macconfig.h 

macconfig.h 

Makefile .MPW 

Makefile.MPW 

nametab.h 

nametab.h 

utfBtab.h 

utfBtab.h 

winconfig.h 

winconfig.h 

xmlparse.c 

xmlparse.c 

xmlrole.c 

xmlrole.c 

xmlrole.h 

xmlrole.h 

xmltok.c 

xmltok.c 

xmltok.h 

xmltok.h 

xmltokimpl.c 

xmltokimpl.c 

xmltokimpl.h 

xmltokimpl.h 

xmltok ns.c 

xmltok ns.c 






NUMBER OF SIMILAR FILENAMES 

® Edit distance between filenames. 

® Similarity >= 85% 

. . . . , edit dist(s,t ) 

similarity is,t) = 1-=- 

max( len (.s’), len (?)) 



NUMBER OF FILES vvim IDENTICAL 
OR SIMILAR CONTENT 

® Use fuzzy hashing (ssdeep). 


® Number of identical hashes. 

® Number of > 80% similar hashes. 
® Number of > 0% similar hashes. 


ssdeep, 1.0--blocksize: hash: hash,filename 

96:KQhaGCVZGhr83h3bc0ok3892m12wzgnH5w2pw+sxNEI58:FIVkH4x73h39LH+2w+sxaD,"config.h" 
96:MD9fHjsEuddrg31904l8bgx5ROg2MQZHZqpAlycowOsexbHDbk:MJwz/l2PqGqqbr2yk6pVgrwPV,"INSTALL" 
96:EQOJvOl4ab3hhiNFXc4wwcweomr0cNJDBoqXjmAHKX8dEt001nfEhVluX0dDcs:3mzpAsZpprbshfu3oujjdEN 
dp21,"README" 



SCORING FILENAMES 

0 README filenames less important. 
0 libpng.c more important. 


©Score filenames using ‘inverse document 
frequency.’ 

\D\ 

idf(t,D) = log 


| {d g D:t e d}\ 


0 Sum scores of matching filenames. 



mrcmm fhleniames between 

PACKAGES 

® Which similar filenames to match? 


® Each matching has a cost - the filename score. 


® Choose matchings to maximize sum of costs. 


% Makefile \ 

# Makefile.ca 

# png43.c 

# png.h 

# README 













THE ASSIGNMENT PROBLEM 


® 


Given two sets, A and T, of equal size, together with a weight function C: A x T —> R. Find a 
bisection /: A —>T such that the cost function: 


£^C(a,/(a)) 


is optimal. 


® Known in combinatorial optimisation as ‘the 
assignment problem.’ 

® Solved optimally in cubic time. 


® Greedy solution is faster. 





FEATURE SELECTION 

0 Not all features are important. 


0 Feature ranking. 


1. Featurel 

2. Feature2 

3. Feature3 




1. Feature3 (0.80) 

2. Featurel (0.60) 

3. Feature2 (0.01) 


©Subset selection. 

1. 

2. 

Featurel 

Feature2 


1. Featurel 

2. Feature2 


3. 

Feature3 




®We chose not to use it. 






CLASSIFICATION 

® Consider feature vectors as N-dimensional 
points. 


® Linear classifiers. 

® Non linear classifiers. 



® Decision trees. 







SCALi m THE ANALYSIS 


MUITICORE 

® Speedup clone detection on a package. 


® Open MP. 















CIUSTERHNG 

® Open MPI. 


® Single job is clone detection on package. 


® Slaves consume jobs. 



® Embarrassingly parallel. 














RUNNING THE ANALYSIS 

® 4 Node Amazon EC2 Cluster 

■ Dual CPU 

■ 8 cores per CPU 

■ 88 EC2 compute units 

■ 60.5G memory per node 

® Clone detection on embedded libs known by 
Debian. 


® Store the results for later use. 


INFERRING SECURITY 
VULNERABILITIES 


PACKAGE CLONE DETECTION OSE- 
CASE 


® By package 


National Cyber-Alert System 

Vulnerability Summary for CVE-2010-0205 


Original release date: 03/03/2010 
_ast revised: 11/18/2010 
Source: US-CERT/NIST 

Overview 

the png_decompress_chunk function in pngrutll.c in libpng 1.0.x before 
1.0.53, 1.2.x before 1.2.43, and 1.4.x before 1.4.1 does not properly 
landle compressed ancillary-chunk data that has a disproportionately large 
jncompressed representation, which allows remote attackers to cause a 
denial of service (memory and CPU consumption, and application hang) via a 
crafted PNG file, as demonstrated by use of the deflate compression 
nethod on data composed of many occurrences of the same character, 
elated to a "decompression bomb" attack. 


S clonewise query-cache libpng 

# The following package clones are tracked in the embedded-code-copies 

# database. They have not been fixed. 

# 

libpng cloned_in_source ia32-libs <unfixed>| 








STANDARDIZATION EFFORTS 


National Cyber-Alert System 

Vulnerability Summary for CVE-2010-0205 


Original release date: 03/03/2010 
_ast revised: 11/18/2010 
Source: US-CERT/NIST 

Overview 

rhe png_decompres$_chunk function in pngrutil.c in libpng l.O.x before 
1.0.53, 1.2.x before 1.2.43, and 1.4.x before 1.4.1 does not properly 
widle compressed ancillary-chunk data that has a disproportionately large 
jncompressed representation, which alows remote attackers to cause a 
jenial of service (memory and CPU consumption, and application hang) via a 
rafted PNG file, as demonstrated by use of the deflate compression 
nethod on data composed of many occurrences of the same character, 
elated to a "decompression bomb" attack. 


Summary: Off-by-one error in the_opiereadrec 

function in readrec.C in libopie in OPIE 2.4.1-testl 
and earlier, as used on FreeBSD 6.4 through 8.1- 
PRERELEASE and other platforms, allows remote 
attackers to cause a denial of service (daemon crash) 
or possibly execute arbitrary code via a long 
username, as demonstrated by a long USER 
command to the FreeBSD 8.0 ftpd. 


Official Common Platform Enumeration (CPE) Dictionary 

CPE is a structured naming scheme for information technology systems, software, and 
packages. Based upon the generic syntax for Uniform Resource Identifiers (URI), CPE 
includes a formal name format, a method for checking names against a system, and a 
description format for binding text and tests to a name. 

Below is the current official version of the CPE Product Dictionary. The dictionary provides 
an agreed upon list of official CPE names. The dictionary is provided in XML format and is 
available to the general public. Please check back frequently as the CPE Product Dictionary 
will continue to grow to include all past, present and future product releases. The CPE 
Dictionary is updated nightly when modifications or new names are added. Archived CPE 
dictionaries are available at http://static.nvd.nist.qov/feeds/xml/cpe/dictionarv/ . 

As of December 2009, The National Vulnerability Database is now accepting contributions to 
the Official CPE Dictionary. Organizations interested in submitting CPE Names should contact 
the NVD CPE team at cpe diCtionary@niSt.gov for help with the processing of their 
submission. 

The CPE Dictionary hosted and maintained at NIST may be used by nongovernmental 
organizations on a voluntary basis and is not subject to copyright in the United States. 
Attribution would, however, be appreciated by NIST. 













DhbiAN SECURHTY TRACKING 



Contents of /data/CPE/list 


V Parent Directory 1 ® Revision Loa 


Revision 18936 - (show annotations) (download) 

Fri Apr 1312:05:16 2012 UTC (2 months, 3 weeks ago) by pere 

File size: 55047 byte(s) 


Update 

: list of CPE entries and aliases based on CVE ids for i 

soil and 2012 . 

1 

a2ps;cpe:/a:gnu:a2ps 


2 

abc2ps;cpe:/a:abc2ps:abc2ps 


3 

abcm2 ps;cpe: / a:jef_moine:abcm2 ps 


4 

abcmidi;cpe : /a:abcmidi : abcmidi 


5 

abiword;cpe : /a:abisource : community_abiword 



/ rsecure-testinal / data / CVE f list 

Contents of /data/CVE/list 

V Parent Directory I IB) Revision Log 


Revision 19695 - ( show annotations ) ( download ) 

Sun Jul 8 21:14:29 2012 UTC (2 hours, 34 minutes ago) by joeyh 
File size: 7142120 byte(s) 


























AUTOMATE0 VULNERABILITY 
INFERENCE 

i. Take CVE, match CPE name to Debian package. 


2 . Parse CVE summary and extract vuln filename 

3. Find clones of package with similar filename. 

4. Trim dynamically linked clones. 

5. Is vuln affected clone already being tracked? 


EXAMPLE 

® By CVE 



Original release date: 03/03/2010 
Last revised: 11/18/2010 
Source: US-CERT/NIST 


Overview 


The png_decompress_chunk function in pngrutil.c in libpng 1.0.x before 
1.0.53, 1.2.x before 1.2.43, and 1.4.x before 1.4.1 does not properly 
handle compressed ancillary-chunk data that has a disproportionately large 
uncompressed representation, which allows remote attackers to cause a 
denial of service (memory and CPU consumption, and application hang) via a 
crafted PNG file, as demonstrated by use of the deflate compression 
method on data composed of many occurrences of the same character, 
related to a "decompression bomb" attack. 


[a 


# summary: The png_decompress_chunk function in pngrutil.c in libpn 

# 1.0.53, 1.2.x before 1.2.43, and 1.4.x before 1.4.1 does not prop 

# compressed ancillary-chunk data that has a disproportionately lar 

# representation, which allows remote attackers to cause a denial o 

# (memory and CPU consumption, and application hang) via a crafted 

# demonstrated by use of the deflate compression method on data com 

# occurrences of the same character, related to a "decompression bo 


# cve- 2010-0205 relates to a vulnerability in package libpng. 

# The following source filenames are likely responsible: 

# pngrutil.c 

# The following package clones are tracked in the embedded-code-cop 

# database. They have not been fixed. 


libpng cloned_in_source ia32-libs <unfixed> cve-2010-0205 





IMPLEMENTATION AND 
EVALUATION 


BMPLEMENTATBON 


® 3,500 Lines of C++ and shell scripts. 

® Open Source 

http://www.github.com/silviocesare/Clonew 

ise 




FILENAMES AS FEATURES 

® Ubuntu Linux 


® 3,077,063 unique filenames. 

® Follows inverse power law distribution. 

® R square value of regression analysis 0.928. 



ESTABLISHING GROUND TRUTH 


® Debian Linux embedded-code-copies.txt. 

■ Not really machine readable. 

■ Cull entries which we can’t match to packages. 

■ 761 labelled positives. 

® Negatives any packages not in positives 

■ 475780 generated labelled negatives. 


PACKAGE CLONE DETECTION 


® Identified 34 previously unknown clones in 
Debian. 

■ Lots more to do. 

® Statistical classification 

■ Random Forest gave best accuracy. 

■ Increasing the decision threshold reduces FPs. 

■ Predict 3 FPs in 10,000 classifications. 

■ More likely an upper limit. 


ACCOKACY OF STATISTICAL 
CLASSIFICATION 


Classifier 

TP/FN 

FP/TN 

TP Rate 

FP Rate 

Naive Bayes 

439/322 

484/56296 

57.69% 

0.85% 

Multilayer 

Perceptron 

204/557 

48/56732 

26.81% 

0.08% 

C4.5 

523/238 

86/56694 

68.73% 

0.15% 

Random Forest 

533/228 

60/56720 

70.04% 

0.11% 

Random Forest 

(Ml_ 

446/315 

15/56765 

58.61% 

0.03% 





EFFICIENCY OF CLONE OETECTBON 

®4 hours on an Amazon HPC cluster. 

® MPLScatter to do static job assignment was 
inefficient. 

■ Better to consume from a work queue. 

® Need to use multicore to balance load. 



VULNERABILITIES OETECTEO 


Package 


Embedded Package 


Package 

Embeddc 

boson 

lib3ds 

libopenscenegraph7 

lib3ds 

libfreeimage 

libpng 

libfreeimage 

libtiff 

libfreeimage 

openexr 

r-base-core 

libbz2 

r-base-core-ra 

libbz2 

lsb-rpm 

libbz2 

criticalmass 

libcurl 

albert 

expat 

mcabber 

expat 

centerim 

expat 

wengophone 

gaim 

libpam-opie 

libopie 

pysol-sound-server 

libmikod 

gnome-xcf-thumnailer 

xcftool 

pit-scheme 

libgd 


OpenSceneGraph 

mrpt-opengl 

mingw32-OpenSceneGraph 

libtlen 

centerim 

mcabber 

udunits2 

libnodeupdown-backend-ganglia 

libwmf 

kadu 

cgit 

tkimg 

tkimg 

ser 

pgpoolAdmin 

sepostgresql _ 


lib3ds 

lib3ds 

lib3ds 

expat 

expat 

expat 

expat 

expat 

gd 

mimetex 

git 

libpng 

libtiff 

php-Smarty 

php-Smarty 

postgresql 








discussion, 

RELATED WORK, 
FUTURE WORK AMO 
CONCLUSION 


PRACTICAL CONSEQUENCES 

©Write access to Debian’s security tracker. 


© Red Hat embedded code copies wiki created. 


© Debian plan to integrate Clonewise into 
infrastructure. 


Revision 15537 - ( view) ( download) ( annotate ) - [selectfor diffsl 
Modified Fri Ocf 29 04:31:39 2010 UTC (20 months ago) by silvio-guest 
File length: 57330 byte(s) 

Diffto previous 15535 


Revision 15535 - ( view) (download) ( annotate) - [select for diffsl 
Modified Thu Oct 28 02:38:42 2010 UTC (20 months ago) by silvio-guest I 
File length: 57277 byte(s) 

Diff to previous 15532 











REFERENCING CVES BN ADVB50RBES 


® Red Hat reference CVEs of embedded libs. 
® Not every vendor does. 

® It would be nice if CVE supported this. 


CVE-2011-3026 

(under review) 


Learn more at National Vulnerability Database 

('NVD') 

• Severity Rating • Fix Information • Vulnerable Software 
Versions • SCAP Mappings 


Integer overflowinlibong^asused in Google Chrome before 17.0.963.56, allows 
remote attacke CTTiWBIfT^Wil enial of service or possibly have unspecified other 
impact via unknown vectors that trigger an integer truncation. 


Critical: xulrunner security update 


Advisory: RHSA-2012:0143-1 
Type: Security Advisory 
Severity: Critical 
Issued on: 2012-02-16 


Last updated on: 2012-02-16 


Affected Products: 


RHEL Desktop Workstation (v. 5 client) 

Red Hat Enterprise Linux Desktop (v. 5 client) 
Red Hat Enterprise Linux Desktop (v. 6) 

Red Hat Enterprise Linux HPC Node (v. 6) 

Red Hat Enterprise Linux Server (v. 6) 

Red Hat Enterprise Linux Server AUS (v. 6.2) 
Red Hat Enterprise Linux Server EUS (v. 6.2.z) 
Red Hat Enterprise Linux Workstation (v. 6) 


<T Vl$~((v e. mitre .0.6 Of): CVE-2011-3iBgff 











EMBEDDED CODE COPBES VERSUS 
CUBE REUSE 

® Clonewise detects code reuse. 

® If zlib embedded in packages X and Y: 

■ Clonewise detects clones between all X, Y, and zlib. 

® What we really want to know is: 

■ X is not cloned in Y. 

■ Zlib is cloned in X and Y. 

® Mitigation 

■ Clone detection on known embedded libraries. 


DELATED WQm 

® Debian Linux zlib audit in 2005 


® Plagiarism detection 

■ Attribute counting 

■ Structure-based 

® Code clone detection 

■ Tokenization 

■ Abstract syntax trees 








condition then el: 

^ i 



-- 

return 

= 




hr" 




X 


0 


x 


1 





































future work 


® Source repositories 

■ Sourceforge 

■ Github 

® Other OSs - BSD etc 



debian 

GNU/Linux 


® Integration into build/packaging systems? 

® Integration into Debian Linux infrastructure. 


WWW. ruutuutLnu. COM 

® More than just Clonewise.. 

® Simseer - Free flowgraph-based malware similarity and 
families. 

® 110,000 LOC C++. Happy to talk to vendors. 



W 
























CONCLUSION 

©Vendors have 10,000+ packages. 


® How to audit for clones? 


® Clonewise can provide a solution. 
® And help improve security. 

® http://www.FooCodeChu.com 



Remember to complete the Black Hat speaker feedback survey. 




